A German Corpus for Similarity Detection Tasks

نویسندگان

Juan-Manuel Torres-Moreno

Gerardo Sierra

Peter Peinl

چکیده

Text similarity detection aims at measuring the degree of similarity between a pair of texts. Corpora available for similarity detection are designed to evaluate the algorithms to assess the paraphrase level among documents. In this paper we present a textual German corpus for similarity detection. The purpose of our corpus is to automatically assess the similarity between a pair of texts and to evaluate different similarity measures, both for whole documents or for individual sentences. Therefore we have calculated several simple measures on our corpus based on a library of similarity functions.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A German Corpus for Text Similarity Detection Tasks

Text similarity detection aims at measuring the degree of similarity between a pair of texts. Corpora available for text similarity detection are designed to evaluate the algorithms to assess the paraphrase level among documents. In this paper we present a textual German corpus for similarity detection. The purpose of this corpus is to automatically assess the similarity between a pair of texts...

متن کامل

GerNED: A German Corpus for Named Entity Disambiguation

Determining the real-world referents for name mentions of persons, organizations and other named entities in texts has become an important task in many information retrieval scenarios and is referred to as Named Entity Disambiguation (NED). While comprehensive datasets support the development and evaluation of NED approaches for English, there are no public datasets to assess NED systems for ot...

متن کامل

Injecting Word Embeddings with Another Language's Resource : An Application of Bilingual Embeddings

Word embeddings learned from text corpus can be improved by injecting knowledge from external resources, while at the same time also specializing them for similarity or relatedness. These knowledge resources (like WordNet, Paraphrase Database) may not exist for all languages. In this work we introduce a method to inject word embeddings of a language with knowledge resource of another language b...

متن کامل

Using Web Corpora for the Automatic Acquisition of Lexical-Semantic Knowledge

This article presents two case studies to explore whether and how web corpora can be used to automatically acquire lexical-semantic knowledge from distributional information. For this purpose, we compare three German web corpora and a traditional newspaper corpus on modelling two types of semantic relatedness: (1) Assuming that free word associations are semantically related to their stimuli, w...

متن کامل

Monolingual Text Similarity Measures: A Comparison of Models over Wikipedia Articles Revisions

Measuring the similarity of texts is a common task in detection of co-derivatives, plagiarism and information flow. In general the objective is to locate those fragments of a document that are derived from another text. We have carried out an exhaustive comparison of similarity estimation models in order to determine which one performs better on different levels of granularity and languages (En...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Int. J. Comput. Linguistics Appl.

دوره 5 شماره

صفحات -

تاریخ انتشار 2014

A German Corpus for Similarity Detection Tasks

نویسندگان

چکیده

منابع مشابه

A German Corpus for Text Similarity Detection Tasks

GerNED: A German Corpus for Named Entity Disambiguation

Injecting Word Embeddings with Another Language's Resource : An Application of Bilingual Embeddings

Using Web Corpora for the Automatic Acquisition of Lexical-Semantic Knowledge

Monolingual Text Similarity Measures: A Comparison of Models over Wikipedia Articles Revisions

عنوان ژورنال:

اشتراک گذاری